Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subgroup Operations #4190

Closed
wants to merge 53 commits into from
Closed

Subgroup Operations #4190

wants to merge 53 commits into from

Conversation

exrook
Copy link
Contributor

@exrook exrook commented Sep 30, 2023

Checklist

  • Run cargo clippy.
  • Run cargo clippy --target wasm32-unknown-unknown if applicable.
  • Add change to CHANGELOG.md. See simple instructions inside file.

Connections
merged with gfx-rs/naga#2523
Closes: #4428

Description
Allows wgsl shaders to perform subgroup operations

Adds wgpu features:

  • SUBGROUP_COMPUTE
  • SUBGROUP_FRAGMENT
  • SUBGROUP_VERTEX

Each feature enables BASIC, VOTE, ARITHMETIC, SHUFFLE, and SHUFFLE_RELATIVE operations to be used in their respective stage. See #4190 (comment) for a short discussion of the portability of this feature set.

Adds new naga capability SUBGROUP, required for a shader to use subgroup builtin functions or parameters.

Adds new naga validator settings subgroup_operations and subgroup_stages determining which sets of the below subgroup operations are valid, and which stages they are valid in.

BASIC operations:

# Performs a control and memory barrier across all invocations in the subgroup
subgroupBarrier()

VOTE operations:

subgroupAll(bool) -> bool
subgroupAny(bool) -> bool

ARITHMETIC operations:

# Operations on scalars and vectors of f32, i32, u32:
# Computes a single result value using values from all active lanes
subgroupAdd(value) -> value
subgroupMul(value) -> value
subgroupMin(value) -> value
subgroupMax(value) -> value
# Operations on scalars and vectors of i32, u32:
# Computes a single result value using values from all active lanes
subgroupAnd(value) -> value
subgroupOr(value) -> value
subgroupXor(value) -> value
# Computes a prefix scan across all active lanes
subgroupPrefixInclusiveAdd(value) -> value
subgroupPrefixInclusiveMul(value) -> value
subgroupPrefixExclusiveAdd(value) -> value
subgroupPrefixExclusiveMul(value) -> value

BALLOT operations:

# Computes a result using a single bit from every active lane
subgroupBallot() -> vec4<u32>
subgroupBallot(bool) -> vec4<u32>
# Operations on scalars and vectors of f32, i32, u32:
# Reads a value from the first active lane into every other active lane
subgroupBroadcastFirst(value) -> value

SHUFFLE operations:

# Operations on scalars and vectors of f32, i32, u32:
# Reads a value from the lane given by index into the current lane, index may vary per lane
subgroupBroadcast(value, index) -> value
# As above, but the index is computed from the current lane id XOR index_mask
subgroupShuffleXor(value, index_mask) -> value

SHUFFLE_RELATIVE operations

# Reads a value from the lane with id = current lane id +/- `offset`
subgroupShuffleUp(value, index_offset) -> value
subgroupShuffleDown(value, index_offset) -> value

New builtins:

# available in any stage, subject to device support
subgroup_invocation_id: u32
subgroup_size: u32
# available only in compute stage
num_subgroups: u32
subgroup_id: u32

Testing
Naga snapshot tests for wgsl, spv input
wgpu gpu test exercising the feature

Thanks @Lichtso for many contributions to this work!

@JMS55
Copy link
Contributor

JMS55 commented Oct 1, 2023

I'm personally super excited for subgroup support, thank you so much for working on it! If you're interested, the other parts of subgroup support I would find valuable are:

  • VK_EXT_subgroup_size_control
  • Reduction operations (summation, etc)
  • Scan operations (inclusive/exclusive prefix sum, etc)

@cwfitzgerald
Copy link
Member

To give you a bit of help in the correct direction:

To determine if we should split up features, it depends on what devices in the wild already support. You can check the metal feature tables, the d3d documentation, and use this tool https://github.com/kainino0x/gpuinfo-vulkan-query to determine how the subgroup ops should be grouped.

Additionally, definitely write some tests that check the functionality. It's the most sure fire way to make sure things are working cross platform.

@cwfitzgerald
Copy link
Member

Oh and ofc, this is awesome! Thank you for working on it!

@Lichtso
Copy link
Contributor

Lichtso commented Oct 11, 2023

First of all, thank you for working on this. It is a highly anticipated feature for me.
If you need help, I would like to contribute as well.

About the feature fragmentation, looking at what the other APIs did:

  • Vulkan 1.2 / SPIRV 1.3 subgroup-ops: VK_SUBGROUP_FEATURE_BASIC_BIT, VK_SUBGROUP_FEATURE_VOTE_BIT, VK_SUBGROUP_FEATURE_ARITHMETIC_BIT, VK_SUBGROUP_FEATURE_BALLOT_BIT, VK_EXT_subgroup_size_control
  • Metal 2.2 SIMD-scoped operations: "permute", "reduction", "barrier" in which there is only a single kind of GPU (the one of the A13 chip), which does not support all three together
  • OpenGL 4.3: GL_EXT_shader_group_vote, GL_ARB_shader_ballot
  • DirectX (shader model 6): Wave-ops: D3D12_FEATURE_DATA_D3D12_OPTIONS1.WaveOps

So Vulkan has the most feature fragmentation and DirectX the least. I would personally prefer only a single feature to cover all operations, because these are all related (by the concept of subgroups) and splitting them does not make it easier for developers. The trade-off being that some devices which could be partially supported won't be supported at all, but it seems such devices are rather rare.

@Lichtso
Copy link
Contributor

Lichtso commented Oct 11, 2023

@exrook I continued your work on this branch: https://github.com/Lichtso/wgpu/tree/subgroup_operations

@exrook exrook force-pushed the subgroup_feature branch 6 times, most recently from b5a67eb to 6dbd21e Compare October 22, 2023 23:15
@exrook exrook changed the title Add feature for subgroup ballot in fragment and compute shaders Add feature for subgroup operations in fragment and compute shaders Oct 23, 2023
@exrook exrook force-pushed the subgroup_feature branch 2 times, most recently from 59c44dc to f67005e Compare October 23, 2023 02:24
@exrook
Copy link
Contributor Author

exrook commented Oct 24, 2023

I've decided to provide one feature bit per shader stage to require support for subgroup operations. i.e. we'll have:

  • SUBGROUP_COMPUTE
  • SUBGROUP_FRAGMENT
  • SUBGROUP_VERTEX

Each feature bit requires support of the following operations in vulkan terms:
BASIC, VOTE, ARITHMETIC, BALLOT, SHUFFLE, SHUFFLE_RELATIVE

annotated gpuinfo vulkan query results

Operations basic, vote, arithmetic, ballot, shuffle, shuffle_relative

It seems the below that the underlying hardware all supports these operations, these reports are just from older driver versions.

Requirement "subgroupSupportedOperations has bits 0b111111" loses 9 (and partially loses 18) further deviceNames:
  In ALL reports (9 deviceNames):
    x Intel(R) HD Graphics (HSW GT1): 2 (10452 15901)
    x Intel(R) HD Graphics 5300: 1 (19258)
    x Intel(R) HD Graphics P4600/P4700 (HSW GT2): 2 (18770 18883)
    x Intel(R) Iris(R) Pro Graphics 5200 (HSW GT3): 1 (18461)
    x Intel(R) Iris(TM) Graphics 6100: 4 (17664 17687 18406 19067)
    x NVIDIA GeForce GT 640M: 1 (19121)
    x NVIDIA GeForce GTX 680MX: 1 (17144)
    x NVIDIA GeForce GTX 775M: 1 (18842)
    x Virtio-GPU Venus (llvmpipe): 2 (14801 14867)
  In SOME reports (18 deviceNames):
    ~ Adreno (TM) 505: 1 of 30 (10174; ok: 4259 4457 4489 4824 6129 7394 7550 8516 8922 9025 9063 9355 9501 9798 10776 11057 11789 12203 12269 12277 12342 13211 14230 14771 14913 16201 16983 18395 19093)
    ~ Intel HD Graphics 4000: 1 of 2 (18923; ok: 9638)
    ~ Intel Iris Graphics: 1 of 2 (19397; ok: 9665)
    ~ Intel Iris Pro Graphics: 3 of 12 (17044 17590 17778; ok: 7063 9812 9814 9816 11362 11481 12653 13573 14151)
    ~ Intel(R) HD Graphics (BYT): 4 of 10 (10325 11896 17121 17921; ok: 7808 8050 8437 8527 8735 9821)
    ~ Intel(R) HD Graphics 2500 (IVB GT1): 12 of 14 (11199 12017 12759 13563 14644 16444 16695 16753 18125 18419 19092 19136; ok: 8863 10001)
    ~ Intel(R) HD Graphics 4000 (IVB GT2): 25 of 35 (10270 10349 10415 11073 11323 11356 11797 12426 12428 12475 13314 13832 14366 14553 14791 15063 15598 16328 16634 17079 17793 18241 18723 18852 19272; ok: 7904 8299 8355 8723 9007 9101 9152 9421 9721 10047)
    ~ Intel(R) HD Graphics 4400 (HSW GT2): 20 of 25 (12060 13517 13551 13561 13679 13826 13870 14021 14023 14177 14293 14502 14517 14682 15078 15131 15588 17280 17428 19019; ok: 7859 8565 8886 9156 10164)
    ~ Intel(R) HD Graphics 4600 (HSW GT2): 12 of 22 (12242 12556 12675 13189 13465 15447 15925 17786 17868 18148 18201 19682; ok: 8018 8030 8067 8069 8089 8247 8357 8666 9720 9795)
    ~ Intel(R) Iris(R) Pro Graphics P5200 (HSW GT3): 1 of 2 (10783; ok: 9912)
    ~ Intel(R) Iris(TM) Graphics 6000: 3 of 4 (18448 18725 19301; ok: 11332)
    ~ Intel(R) Iris(TM) Pro Graphics 6200: 1 of 2 (18412; ok: 19257)
    ~ Mali-G52: 1 of 36 (15490; ok: 9942 10194 11056 11982 12046 12062 12284 12622 12630 12643 12798 14045 14925 15260 15378 15727 15736 15996 16385 16469 16506 16992 17167 17176 17292 17649 17729 18106 18134 18226 18280 18582 18589 18744 19671)
    ~ NVIDIA GeForce GT 750M: 5 of 9 (17043 17768 18304 18632 19335; ok: 7062 9811 9813 9815)
    ~ Turnip Adreno (TM) 640: 2 of 14 (14106 14285; ok: 14550 14798 15940 16124 16511 16629 16848 16853 16881 16945 17877 19410)
    ~ Turnip Adreno (TM) 650: 6 of 56 (14102 14104 14119 14231 14280 14809; ok: 14543 14612 14702 14795 14920 14931 14940 15470 15779 15969 16012 16013 16039 16054 16198 16232 16234 16346 16347 16400 16460 16508 16510 16513 16546 16599 16609 16656 16718 16738 17011 17012 17114 17135 17187 17431 17495 17984 18014 18015 18361 18481 18519 18909 18992 19026 19152 19275 19369 19505)
    ~ Turnip Adreno (TM) 660: 2 of 21 (14269 15625; ok: 15328 15722 15782 16006 16016 16189 16455 16514 16515 16621 16668 16808 16947 17083 17095 17163 18888 18896 19339)
    ~ llvmpipe: 62 of 274 (13810 13838 13864 14050 14115 14236 14338 14348 14516 14538 14542 14556 14567 14570 14571 14633 14657 14683 14695 14707 14721 14727 14734 14737 14750 14785 14797 14810 14821 14843 14847 14870 14886 14889 14902 14909 14919 14951 14971 14975 15011 15014 15027 15046 15058 15071 15088 15098 15121 15136 15195 15215 15227 15239 15249 15276 15279 15309 15431 15488 17059 17269; ok: 15092 15285 15489 15568 15639 15695 15799 15809 15811 15847 15922 15974 15990 16022 16041 16044 16064 16068 16069 16103 16116 16119 16139 16149 16173 16180 16192 16216 16240 16244 16250 16268 16272 16277 16299 16337 16340 16349 16370 16374 16384 16395 16420 16423 16457 16467 16472 16481 16486 16495 16502 16504 16519 16525 16534 16554 16561 16572 16581 16598 16627 16643 16673 16678 16699 16702 16725 16780 16798 16816 16838 16852 16866 16878 16892 16909 16939 16964 16975 17020 17041 17054 17069 17075 17103 17124 17162 17185 17197 17215 17243 17257 17301 17353 17378 17394 17424 17445 17483 17503 17509 17541 17569 17601 17617 17672 17708 17727 17745 17763 17802 17807 17822 17849 17871 17899 17926 17961 17971 17989 17993 17996 18004 18008 18047 18061 18089 18093 18117 18130 18151 18155 18160 18179 18222 18228 18246 18258 18301 18309 18358 18418 18424 18434 18470 18492 18495 18506 18535 18548 18555 18557 18558 18561 18597 18633 18638 18665 18671 18689 18708 18726 18737 18779 18821 18827 18836 18847 18858 18880 18885 18906 18913 18937 18949 18963 18984 19008 19023 19032 19054 19064 19072 19087 19100 19107 19142 19156 19251 19264 19280 19315 19324 19350 19363 19414 19418 19440 19461 19491 19545 19550 19564 19596 19599 19634 19640 19662 19684 19696 19732 19736)

Compute stage support

After removing devices lacking support for our selected operations. If the device supports any subgroup operations at all they must be supported in compute stage according to vulkan standard anyways.

Requirement "subgroupSupportedStages has bits 0b100000" loses no further reports!

Fragment stage support

Requirement "subgroupSupportedStages has bits 0b10000" loses 8 (and partially loses 0) further deviceNames:
  In ALL reports (8 deviceNames):
    x Turnip Adreno (TM) 610: 1 (19343)
    x Turnip Adreno (TM) 618: 4 (14703 19399 19409 19675)
    x Turnip Adreno (TM) 619: 1 (16229)
    x Turnip Adreno (TM) 630: 8 (14614 15190 15899 16477 16537 16670 19167 19192)
    x Turnip Adreno (TM) 640: 12 (14550 14798 15940 16124 16511 16629 16848 16853 16881 16945 17877 19410)
    x Turnip Adreno (TM) 650: 50 (14543 14612 14702 14795 14920 14931 14940 15470 15779 15969 16012 16013 16039 16054 16198 16232 16234 16346 16347 16400 16460 16508 16510 16513 16546 16599 16609 16656 16718 16738 17011 17012 17114 17135 17187 17431 17495 17984 18014 18015 18361 18481 18519 18909 18992 19026 19152 19275 19369 19505)
    x Turnip Adreno (TM) 740: 1 (19591)
    x Turnip Adreno (TM) 7Adreno (TM) 740SX_SECOND_VECTOR: 1 (19289)
  In SOME reports (0 deviceNames):

Vertex stage support

These also seem to only lack support due to old drivers. To be clear, this is after devices have been removed by the previous requirements.

Requirement "subgroupSupportedStages has bits 0b1" loses 16 (and partially loses 23) further deviceNames:
  In ALL reports (16 deviceNames):
    x AMD Radeon Pro 455: 1 (19237)
    x AMD Radeon Pro 5500 XT: 1 (19703)
    x AMD Radeon Pro 555: 2 (19182 19721)
    x AMD Radeon Pro 560: 3 (16847 17783 18667)
    x AMD Radeon Pro 570: 1 (18186)
    x AMD Radeon Pro Vega 20: 2 (16905 17497)
    x AMD Radeon Pro Vega 48: 1 (18502)
    x AMD Radeon Pro W5700X: 1 (18292)
    x AMD Radeon RX 570: 1 (17293)
    x Apple M1 Ultra: 4 (17092 17381 17879 18338)
    x Apple M2 Max: 1 (18789)
    x Apple M2 Pro: 1 (19130)
    x Intel(R) Iris(TM) Plus Graphics 650: 1 (16944)
    x Intel(R) Iris(TM) Plus Graphics 655: 1 (19338)
    x Mali-G715-Immortalis MC11: 2 (17913 17955)
    x llvmpipe: 212 (15092 15285 15489 15568 15639 15695 15799 15809 15811 15847 15922 15974 15990 16022 16041 16044 16064 16068 16069 16103 16116 16119 16139 16149 16173 16180 16192 16216 16240 16244 16250 16268 16272 16277 16299 16337 16340 16349 16370 16374 16384 16395 16420 16423 16457 16467 16472 16481 16486 16495 16502 16504 16519 16525 16534 16554 16561 16572 16581 16598 16627 16643 16673 16678 16699 16702 16725 16780 16798 16816 16838 16852 16866 16878 16892 16909 16939 16964 16975 17020 17041 17054 17069 17075 17103 17124 17162 17185 17197 17215 17243 17257 17301 17353 17378 17394 17424 17445 17483 17503 17509 17541 17569 17601 17617 17672 17708 17727 17745 17763 17802 17807 17822 17849 17871 17899 17926 17961 17971 17989 17993 17996 18004 18008 18047 18061 18089 18093 18117 18130 18151 18155 18160 18179 18222 18228 18246 18258 18301 18309 18358 18418 18424 18434 18470 18492 18495 18506 18535 18548 18555 18557 18558 18561 18597 18633 18638 18665 18671 18689 18708 18726 18737 18779 18821 18827 18836 18847 18858 18880 18885 18906 18913 18937 18949 18963 18984 19008 19023 19032 19054 19064 19072 19087 19100 19107 19142 19156 19251 19264 19280 19315 19324 19350 19363 19414 19418 19440 19461 19491 19545 19550 19564 19596 19599 19634 19640 19662 19684 19696 19732 19736)
  In SOME reports (23 deviceNames):
    ~ AMD Radeon Pro 5300M: 4 of 9 (17089 18227 18809 19516; ok: 10099 10103 12130 13339 13860)
    ~ AMD Radeon Pro 5500M: 5 of 6 (17321 18370 18402 18876 19496; ok: 12173)
    ~ AMD Radeon Pro 555X: 1 of 5 (18851; ok: 12248 12834 13632 18849)
    ~ AMD Radeon Pro 560X: 3 of 4 (16995 17003 18525; ok: 16782)
    ~ AMD Radeon Pro 580X: 1 of 2 (19109; ok: 10495)
    ~ AMD Radeon RX 580: 1 of 2 (18921; ok: 11047)
    ~ AMD Radeon RX 6600: 2 of 11 (18663 19345; ok: 12796 13253 14295 14943 15928 17339 17513 19296 19530)
    ~ AMD Radeon RX 6800 XT: 1 of 21 (18291; ok: 10010 10710 11111 11563 11990 12279 12889 13194 13613 14095 15060 15613 16040 16352 16880 17741 18043 18512 18661 19171)
    ~ AMD Radeon RX Vega 64: 1 of 18 (18263; ok: 5659 6898 9626 9630 9646 10079 10245 11293 11294 12289 13161 14191 14837 15034 15539 15540 16135)
    ~ Adreno (TM) 730: 9 of 88 (17654 17840 17841 18073 18090 18530 19101 19382 19709; ok: 13441 13717 13731 13905 14181 14215 14220 14270 14277 14397 14481 14487 14511 14524 14905 14988 15402 15434 15453 15574 15646 15760 15771 15781 15803 15860 15889 15913 15929 15949 16163 16191 16203 16212 16217 16226 16263 16269 16312 16318 16496 16536 16568 16664 16703 16999 17028 17345 17475 17485 17494 17565 17631 17680 17696 17798 17911 17963 17966 18028 18136 18212 18237 18251 18288 18386 18430 18594 18595 18717 18718 18793 18956 19219 19232 19254 19691 19708 19742)
    ~ Adreno (TM) 740: 14 of 24 (17447 17914 18192 18342 18462 18957 18960 19058 19185 19217 19218 19319 19600 19625; ok: 18498 18500 18742 18853 19158 19184 19196 19248 19383 19593)
    ~ Apple M1: 16 of 45 (16632 16913 17007 17148 17150 18000 18072 18656 18733 18784 18934 19110 19135 19198 19503 19606; ok: 11048 11395 11396 11632 11689 11884 12086 13000 13410 13597 14080 14169 14584 14630 14927 15137 15250 15281 15337 15338 15517 15518 15671 15750 15791 15937 16158 16446 16832)
    ~ Apple M1 Max: 13 of 21 (16663 16843 16914 17767 17845 18302 18340 18407 18624 18895 18971 19317 19632; ok: 13018 14522 14673 14752 15018 15815 16220 16557)
    ~ Apple M1 Pro: 11 of 13 (16934 16989 17277 17872 18152 18537 18565 18691 18749 18975 19041; ok: 13606 14001)
    ~ Apple M2: 8 of 9 (16915 17860 17895 18216 18787 19113 19223 19510; ok: 19464)
    ~ Intel(R) HD Graphics 630: 2 of 81 (17784 18668; ok: 2074 2145 2482 2514 2519 2963 3117 3284 3389 3669 3955 4202 4268 4465 4478 4655 5015 5169 5423 5453 5583 5781 5883 6156 6221 6371 6614 6673 6795 6874 6933 7091 7183 7229 7303 7350 7439 7521 7609 7734 7797 8344 8525 8596 8700 8817 8982 9253 9330 9364 9508 9678 9740 9793 10151 10280 10356 10615 10794 11412 11414 11587 12156 12389 13044 14077 14286 14772 14840 15263 15692 16053 16237 16644 17048 17360 18529 18538 19057)
    ~ Intel(R) Iris(TM) Plus Graphics: 5 of 7 (17065 17132 17350 18611 19575; ok: 11877 13745)
    ~ Intel(R) Iris(TM) Plus Graphics 640: 1 of 7 (17618; ok: 1732 13263 13264 13266 13756 15516)
    ~ Intel(R) Iris(TM) Plus Graphics 645: 2 of 4 (17555 18861; ok: 13596 15806)
    ~ Intel(R) UHD Graphics 630: 7 of 127 (16996 17880 18286 18401 18810 18877 19497; ok: 2796 3114 3168 3346 3744 3817 4385 4513 4686 4890 4919 5285 5639 6022 6050 6181 6187 6543 6601 6649 6811 6923 6991 7024 7092 7185 7298 7518 7748 7956 8100 8267 8569 8580 8586 8619 8792 8857 8984 9003 9204 9322 9406 9420 9440 9458 9465 9578 9601 9732 9839 10081 10096 10165 10227 10257 10388 10745 11153 11622 11817 11913 11925 12149 12249 12433 12597 12657 12670 12711 12762 12835 12848 12975 13029 13067 13252 13407 13417 13576 14040 14127 14398 14427 14591 14605 14712 14968 15125 15297 15345 15562 15564 15712 15939 16164 16378 16721 16770 16783 16811 16856 17175 17456 17634 17679 17685 17765 18095 18181 18184 18414 18508 18728 18850 19090 19253 19387 19437 19547)
    ~ Mali-G52: 2 of 35 (17649 17729; ok: 9942 10194 11056 11982 12046 12062 12284 12622 12630 12643 12798 14045 14925 15260 15378 15727 15736 15996 16385 16469 16506 16992 17167 17176 17292 18106 18134 18226 18280 18582 18589 18744 19671)
    ~ Mali-G710: 1 of 3 (19220; ok: 16769 16823)
    ~ Mali-G78: 2 of 30 (17843 19077; ok: 10390 10395 10625 10699 11648 12368 12788 12851 13091 13271 13582 14501 14514 15142 15143 15551 15603 16031 16143 16265 16311 16441 17127 17427 17487 18249 18399 19270)

Support on Vulkan

Support for these subgroup operations in compute shaders seems pretty universal among devices that already meet WebGPU minimum requirements. Fragment and vertex subgroup operations also seem widely supported, if you have the latest drivers.

Support on older Apple GPUs

The Apple A13 GPU is interesting in that it supports shuffle operations but not reduction operations. In theory one could provide an implementation of reductions in terms of shuffle (permutes in MSL) operations, like how mesa does for older AMD gpus, though I'll leave that exercise to someone else 😆 .

On MoltenVK A12 and A11 also advertise subgroup operations besides reductions, but report a subgroup size of 4 so really these are just quad operations MoltenVK is pretending are subgroup operations. These gpus do not support any real subgroup operations, so we couldn't do the emulation above either.

A separate group of flags for quad operations may be useful for users wanting to implement algorithms targeting these older devices.

Vulkan devices supporting the above operations but not quad operations
Requirement "subgroupSupportedOperations has bits 0b10000000" loses 9 (and partially loses 2) further deviceNames:
  In ALL reports (9 deviceNames):
    x PowerVR B-Series BXE-4-32: 1 (18067)
    x Turnip Adreno (TM) 610: 1 (19343)
    x Turnip Adreno (TM) 618: 4 (14703 19399 19409 19675)
    x Turnip Adreno (TM) 619: 1 (16229)
    x Turnip Adreno (TM) 630: 8 (14614 15190 15899 16477 16537 16670 19167 19192)
    x Turnip Adreno (TM) 640: 12 (14550 14798 15940 16124 16511 16629 16848 16853 16881 16945 17877 19410)
    x Turnip Adreno (TM) 650: 50 (14543 14612 14702 14795 14920 14931 14940 15470 15779 15969 16012 16013 16039 16054 16198 16232 16234 16346 16347 16400 16460 16508 16510 16513 16546 16599 16609 16656 16718 16738 17011 17012 17114 17135 17187 17431 17495 17984 18014 18015 18361 18481 18519 18909 18992 19026 19152 19275 19369 19505)
    x Turnip Adreno (TM) 740: 1 (19591)
    x Turnip Adreno (TM) 7Adreno (TM) 740SX_SECOND_VECTOR: 1 (19289)
  In SOME reports (2 deviceNames):
    ~ SwiftShader Device: 2 of 8 (14246 19306; ok: 16262 16766 18449 18451 18694 18695)
    ~ llvmpipe: 35 of 212 (15092 15285 15922 15974 16022 16064 16068 16116 16139 16149 16173 16180 16216 16240 16244 16250 16272 16299 16337 16340 16370 16374 16384 16395 16420 16423 16467 16472 16481 16486 16495 16502 16519 16534 16554; ok: 15489 15568 15639 15695 15799 15809 15811 15847 15990 16041 16044 16069 16103 16119 16192 16268 16277 16349 16457 16504 16525 16561 16572 16581 16598 16627 16643 16673 16678 16699 16702 16725 16780 16798 16816 16838 16852 16866 16878 16892 16909 16939 16964 16975 17020 17041 17054 17069 17075 17103 17124 17162 17185 17197 17215 17243 17257 17301 17353 17378 17394 17424 17445 17483 17503 17509 17541 17569 17601 17617 17672 17708 17727 17745 17763 17802 17807 17822 17849 17871 17899 17926 17961 17971 17989 17993 17996 18004 18008 18047 18061 18089 18093 18117 18130 18151 18155 18160 18179 18222 18228 18246 18258 18301 18309 18358 18418 18424 18434 18470 18492 18495 18506 18535 18548 18555 18557 18558 18561 18597 18633 18638 18665 18671 18689 18708 18726 18737 18779 18821 18827 18836 18847 18858 18880 18885 18906 18913 18937 18949 18963 18984 19008 19023 19032 19054 19064 19072 19087 19100 19107 19142 19156 19251 19264 19280 19315 19324 19350 19363 19414 19418 19440 19461 19491 19545 19550 19564 19596 19599 19634 19640 19662 19684 19696 19732 19736)
Vulkan devices supporting the above operations but not clustered operations
Requirement "subgroupSupportedOperations has bits 0b1000000" loses 27 (and partially loses 20) further deviceNames:
  In ALL reports (27 deviceNames):
    x AMD Radeon Pro 455: 1 (19237)
    x AMD Radeon Pro 5500 XT: 1 (19703)
    x AMD Radeon Pro 555: 2 (19182 19721)
    x AMD Radeon Pro 560: 3 (16847 17783 18667)
    x AMD Radeon Pro 570: 1 (18186)
    x AMD Radeon Pro Vega 20: 2 (16905 17497)
    x AMD Radeon Pro Vega 48: 1 (18502)
    x AMD Radeon Pro W5700X: 1 (18292)
    x AMD Radeon RX 570: 1 (17293)
    x Apple M1 Ultra: 4 (17092 17381 17879 18338)
    x Apple M2 Max: 1 (18789)
    x Apple M2 Pro: 1 (19130)
    x Intel(R) Iris(TM) Plus Graphics 650: 1 (16944)
    x Intel(R) Iris(TM) Plus Graphics 655: 1 (19338)
    x PowerVR B-Series BXE-4-32: 1 (18067)
    x SwiftShader Device: 8 (14246 16262 16766 18449 18451 18694 18695 19306)
    x Turnip Adreno (TM) 610: 1 (19343)
    x Turnip Adreno (TM) 618: 4 (14703 19399 19409 19675)
    x Turnip Adreno (TM) 619: 1 (16229)
    x Turnip Adreno (TM) 630: 8 (14614 15190 15899 16477 16537 16670 19167 19192)
    x Turnip Adreno (TM) 640: 12 (14550 14798 15940 16124 16511 16629 16848 16853 16881 16945 17877 19410)
    x Turnip Adreno (TM) 650: 50 (14543 14612 14702 14795 14920 14931 14940 15470 15779 15969 16012 16013 16039 16054 16198 16232 16234 16346 16347 16400 16460 16508 16510 16513 16546 16599 16609 16656 16718 16738 17011 17012 17114 17135 17187 17431 17495 17984 18014 18015 18361 18481 18519 18909 18992 19026 19152 19275 19369 19505)
    x Turnip Adreno (TM) 660: 19 (15328 15722 15782 16006 16016 16189 16455 16514 16515 16621 16668 16808 16947 17083 17095 17163 18888 18896 19339)
    x Turnip Adreno (TM) 690: 1 (19161)
    x Turnip Adreno (TM) 740: 1 (19591)
    x Turnip Adreno (TM) 7Adreno (TM) 740SX_SECOND_VECTOR: 1 (19289)
    x llvmpipe: 212 (15092 15285 15489 15568 15639 15695 15799 15809 15811 15847 15922 15974 15990 16022 16041 16044 16064 16068 16069 16103 16116 16119 16139 16149 16173 16180 16192 16216 16240 16244 16250 16268 16272 16277 16299 16337 16340 16349 16370 16374 16384 16395 16420 16423 16457 16467 16472 16481 16486 16495 16502 16504 16519 16525 16534 16554 16561 16572 16581 16598 16627 16643 16673 16678 16699 16702 16725 16780 16798 16816 16838 16852 16866 16878 16892 16909 16939 16964 16975 17020 17041 17054 17069 17075 17103 17124 17162 17185 17197 17215 17243 17257 17301 17353 17378 17394 17424 17445 17483 17503 17509 17541 17569 17601 17617 17672 17708 17727 17745 17763 17802 17807 17822 17849 17871 17899 17926 17961 17971 17989 17993 17996 18004 18008 18047 18061 18089 18093 18117 18130 18151 18155 18160 18179 18222 18228 18246 18258 18301 18309 18358 18418 18424 18434 18470 18492 18495 18506 18535 18548 18555 18557 18558 18561 18597 18633 18638 18665 18671 18689 18708 18726 18737 18779 18821 18827 18836 18847 18858 18880 18885 18906 18913 18937 18949 18963 18984 19008 19023 19032 19054 19064 19072 19087 19100 19107 19142 19156 19251 19264 19280 19315 19324 19350 19363 19414 19418 19440 19461 19491 19545 19550 19564 19596 19599 19634 19640 19662 19684 19696 19732 19736)
  In SOME reports (20 deviceNames):
    ~ AMD Radeon Pro 5300M: 4 of 9 (17089 18227 18809 19516; ok: 10099 10103 12130 13339 13860)
    ~ AMD Radeon Pro 5500M: 5 of 6 (17321 18370 18402 18876 19496; ok: 12173)
    ~ AMD Radeon Pro 555X: 1 of 5 (18851; ok: 12248 12834 13632 18849)
    ~ AMD Radeon Pro 560X: 3 of 4 (16995 17003 18525; ok: 16782)
    ~ AMD Radeon Pro 580X: 1 of 2 (19109; ok: 10495)
    ~ AMD Radeon RX 580: 1 of 2 (18921; ok: 11047)
    ~ AMD Radeon RX 6600: 2 of 11 (18663 19345; ok: 12796 13253 14295 14943 15928 17339 17513 19296 19530)
    ~ AMD Radeon RX 6800 XT: 1 of 21 (18291; ok: 10010 10710 11111 11563 11990 12279 12889 13194 13613 14095 15060 15613 16040 16352 16880 17741 18043 18512 18661 19171)
    ~ AMD Radeon RX Vega 64: 1 of 18 (18263; ok: 5659 6898 9626 9630 9646 10079 10245 11293 11294 12289 13161 14191 14837 15034 15539 15540 16135)
    ~ Adreno (TM) 730: 9 of 88 (17654 17840 17841 18073 18090 18530 19101 19382 19709; ok: 13441 13717 13731 13905 14181 14215 14220 14270 14277 14397 14481 14487 14511 14524 14905 14988 15402 15434 15453 15574 15646 15760 15771 15781 15803 15860 15889 15913 15929 15949 16163 16191 16203 16212 16217 16226 16263 16269 16312 16318 16496 16536 16568 16664 16703 16999 17028 17345 17475 17485 17494 17565 17631 17680 17696 17798 17911 17963 17966 18028 18136 18212 18237 18251 18288 18386 18430 18594 18595 18717 18718 18793 18956 19219 19232 19254 19691 19708 19742)
    ~ Adreno (TM) 740: 14 of 24 (17447 17914 18192 18342 18462 18957 18960 19058 19185 19217 19218 19319 19600 19625; ok: 18498 18500 18742 18853 19158 19184 19196 19248 19383 19593)
    ~ Apple M1: 16 of 45 (16632 16913 17007 17148 17150 18000 18072 18656 18733 18784 18934 19110 19135 19198 19503 19606; ok: 11048 11395 11396 11632 11689 11884 12086 13000 13410 13597 14080 14169 14584 14630 14927 15137 15250 15281 15337 15338 15517 15518 15671 15750 15791 15937 16158 16446 16832)
    ~ Apple M1 Max: 13 of 21 (16663 16843 16914 17767 17845 18302 18340 18407 18624 18895 18971 19317 19632; ok: 13018 14522 14673 14752 15018 15815 16220 16557)
    ~ Apple M1 Pro: 11 of 13 (16934 16989 17277 17872 18152 18537 18565 18691 18749 18975 19041; ok: 13606 14001)
    ~ Apple M2: 8 of 9 (16915 17860 17895 18216 18787 19113 19223 19510; ok: 19464)
    ~ Intel(R) HD Graphics 630: 2 of 81 (17784 18668; ok: 2074 2145 2482 2514 2519 2963 3117 3284 3389 3669 3955 4202 4268 4465 4478 4655 5015 5169 5423 5453 5583 5781 5883 6156 6221 6371 6614 6673 6795 6874 6933 7091 7183 7229 7303 7350 7439 7521 7609 7734 7797 8344 8525 8596 8700 8817 8982 9253 9330 9364 9508 9678 9740 9793 10151 10280 10356 10615 10794 11412 11414 11587 12156 12389 13044 14077 14286 14772 14840 15263 15692 16053 16237 16644 17048 17360 18529 18538 19057)
    ~ Intel(R) Iris(TM) Plus Graphics: 5 of 7 (17065 17132 17350 18611 19575; ok: 11877 13745)
    ~ Intel(R) Iris(TM) Plus Graphics 640: 1 of 7 (17618; ok: 1732 13263 13264 13266 13756 15516)
    ~ Intel(R) Iris(TM) Plus Graphics 645: 2 of 4 (17555 18861; ok: 13596 15806)
    ~ Intel(R) UHD Graphics 630: 7 of 127 (16996 17880 18286 18401 18810 18877 19497; ok: 2796 3114 3168 3346 3744 3817 4385 4513 4686 4890 4919 5285 5639 6022 6050 6181 6187 6543 6601 6649 6811 6923 6991 7024 7092 7185 7298 7518 7748 7956 8100 8267 8569 8580 8586 8619 8792 8857 8984 9003 9204 9322 9406 9420 9440 9458 9465 9578 9601 9732 9839 10081 10096 10165 10227 10257 10388 10745 11153 11622 11817 11913 11925 12149 12249 12433 12597 12657 12670 12711 12762 12835 12848 12975 13029 13067 13252 13407 13417 13576 14040 14127 14398 14427 14591 14605 14712 14968 15125 15297 15345 15562 15564 15712 15939 16164 16378 16721 16770 16783 16811 16856 17175 17456 17634 17679 17685 17765 18095 18181 18184 18414 18508 18728 18850 19090 19253 19387 19437 19547)

@exrook exrook changed the title Add feature for subgroup operations in fragment and compute shaders Add features for subgroup operations in shaders Oct 24, 2023
@exrook
Copy link
Contributor Author

exrook commented Oct 24, 2023

Interesting that the mac CI is now failing with a shader compiler error on MoltenVK after this reorganization:

[mvk-error] VK_ERROR_INITIALIZATION_FAILED: Compute pipeline compile failed (Error code 3):
Compiler encountered an internal error.

The MoltenVK test was previously being skipped

@exrook exrook force-pushed the subgroup_feature branch 3 times, most recently from 08227ef to 21549e6 Compare October 24, 2023 18:59
@exrook
Copy link
Contributor Author

exrook commented Oct 24, 2023

Forcing MoltenVK debug on when running the test gets it to print out the translated MSL, which seems to compile fine when pasted into shader-playground, so I've got no idea what's going wrong here :)

@cwfitzgerald cwfitzgerald requested a review from a team November 15, 2023 21:46
Copy link
Member

@cwfitzgerald cwfitzgerald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small changes from wgpu side

Please re-request a review from me once the changes are addressed to make sure I see it!

tests/tests/subgroup_operations/shader.wgsl Outdated Show resolved Hide resolved
tests/tests/subgroup_operations/shader.wgsl Outdated Show resolved Hide resolved
@exrook exrook requested a review from cwfitzgerald November 21, 2023 13:51
CHANGELOG.md Outdated Show resolved Hide resolved
@exrook
Copy link
Contributor Author

exrook commented Nov 22, 2023

The control flow tests immediately pay off with a failure on Metal :)

Minimized example:

WGSL source
@group(0)
@binding(0)
var<storage, read_write> storage_buffer: array<u32>;

@compute
@workgroup_size(128)
fn main(
    @builtin(global_invocation_id) global_id: vec3<u32>,
    @builtin(subgroup_size) subgroup_size: u32,
    @builtin(subgroup_invocation_id) subgroup_invocation_id: u32,
) {
    var value = 0u;

    if subgroup_invocation_id % 2u == 0u {
        value = subgroupAdd(1u);
    } else {
        value = subgroup_size / 2u;
    }

    storage_buffer[global_id.x] = value;
}
naga MSL output
// language: metal1.0
#include <metal_stdlib>
#include <simd/simd.h>

using metal::uint;

struct _mslBufferSizes {
    uint size0;
};

typedef uint type_1[1];

struct main_Input {
};
kernel void main_(
  metal::uint3 global_id [[thread_position_in_grid]]
, uint subgroup_size [[threads_per_simdgroup]]
, uint subgroup_invocation_id [[thread_index_in_simdgroup]]
, device type_1& storage_buffer [[user(fake0)]]
, constant _mslBufferSizes& _buffer_sizes [[user(fake0)]]
) {
    uint value = 0u;
    if ((subgroup_invocation_id % 2u) == 0u) {
        uint unnamed = metal::simd_sum(1u);
        value = unnamed;
    } else {
        value = subgroup_size / 2u;
    }
    uint _e16 = value;
    storage_buffer[global_id.x] = _e16;
    return;
}

This program results in a storage buffer with contents [32, 16, 32, 16, 32, 16, ... ]. i.e. the subgroupAdd(1u) has all 32 threads participate, despite the fact that as written, only half of the threads should be active and participate.

Interestingly, moving the store inside the if as shown below produces the expected output of [32, 32, 32, 32, ...]:

    if subgroup_invocation_id % 2u == 0u {
        storage_buffer[global_id.x] = subgroupAdd(1u);
    } else {
        storage_buffer[global_id.x] = subgroup_size/2u;
    }

Additionally, adding a dummy store to one of the branches also forces the correct behavior:

    if subgroup_invocation_id % 2u == 0u {
        storage_buffer[global_id.x] = 0u;
        value = subgroupAdd(1u);
    } else {
        value = subgroup_size/2u;
    }
    storage_buffer[global_id.x] = value;

Substituting subgroupAdd(1u) for subgroupBallot().x in these test cases produces what you would expect given the behavior we're observing. In the examples that produce a sum of 32 inside the branch the the lane mask is fully active with 0xffffffff, but the examples with a store inside the branch produce a half active mask of 0x55555555.

Given that the metal shaders we produce seem to be correct as far as I can tell, this looks like a bug in metal where the compiler is incorrectly reordering subgroup operations with respect to our desired control flow, producing resultant control flow similar to below:

    var value = 0u;
    let sum = subgroupAdd(1u);
    let size_half = subgroup_size / 2u;
    if subgroup_invocation_id % 2u == 0u {
        value = sum;
    } else {
        value = size_half;
    }
    storage_buffer[global_id.x] = value;

I'm not really an apple user, but maybe someone else knows how we could report a bug to metal about this behavior?

@exrook
Copy link
Contributor Author

exrook commented Nov 22, 2023

Interestingly, inserting a subgroupBarrier() inside the branch seems to also force the expected behavior, in spite of the metal spec stating that "if any thread enters the conditional statement and executes the barrier function, then all threads in the [SIMD-group] need to enter the conditional and execute the barrier function." for simdgroup_barrier (MSL Spec v3.1 pg. 171)

@Lichtso
Copy link
Contributor

Lichtso commented Nov 22, 2023

Putting in (any) global memory operation on a conditional branch probably introduces a memory barrier under the hood. And yes, figuring out the precise semantics of control-flow and memory barriers will be tough, especially because it seems to me that the developers of the underlying APIs did not think them through either. I already hinted at the barrier problem in the official spec PR because that proposal is currently missing them.

@cwfitzgerald
Copy link
Member

Not much we can do about it if metal is miscompiling even the most basic of shader.

This sounds like it should be an expected failure with a corresponding issue. Preferably the infra doing the testing will assume that metal will be wrong, based on the expected miscompilation.

Copy link
Member

@cwfitzgerald cwfitzgerald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we still have some CI failures

@JMS55
Copy link
Contributor

JMS55 commented Jan 12, 2024

@exrook are you planning to finish this PR? I would love to be able to use subgroups.

@exrook
Copy link
Contributor Author

exrook commented Jan 24, 2024

@JMS55 here's an update of the status of this PR as I see it

The CI failures are due to bugs in the backends implementation of subgroups (lavapipe & metal) that are unrelated to our mapping from wgsl. I've done some investigation into the underlying causes but wasn't able to find anything. I'll disable the triggering tests for now, I don't believe they should block this getting merged.

That being said I believe the Naga side of this PR has still not been reviewed, so that's still a blocker for merging.

We could also consider updating the function names to match the draft webGPU subgroups proposal, though personally I prefer the names already used in this PR for a few of the operations that differ. This PR also does not currently implement the quad operations specified in the proposal, I don't really have a desire to use these but if someone would like to use them I'd be able to add them.

@cwfitzgerald
Copy link
Member

The CI failures are due to bugs in the backends implementation of subgroups (lavapipe & metal) that are unrelated to our mapping from wgsl

What's the issue on lavapipe? I would expect lavapipe to be conformant.

I'll disable the triggering tests for now, I don't believe they should block this getting merged.

All test expectations should be as narrowly scoped as possible, so we assert that we will get this exact result on lavapipe and on metal.

@Lichtso
Copy link
Contributor

Lichtso commented Jan 24, 2024

This PR also does not currently implement the quad operations specified in the proposal

We are missing clustered operations, and if I am not mistaken quad operations (as they can be used outside of fragment shaders) are just a special case of 4-way-clustered operations. IIRC, metal has no general clustered operations (other than quads), so I am ok with leaving them out for now.

The differences to the webGPU draft are:

  • Naming Add / Mul vs. Sum / Product
  • They are indecisive about num_subgroups and subgroup_id
  • We are missing the quad operations
  • We have exclusive and inclusive prefix scan operations, they only have the exclusive ones
  • They additionally have subgroupElect() (subgroupBroadcastFirst(subgroup_invocation_id) == subgroup_invocation_id)

@teoxoy
Copy link
Member

teoxoy commented Jan 26, 2024

Thanks for the summary, I will try to review this next week.
It would be great if you could rebase/merge trunk into it.

@teoxoy teoxoy self-requested a review January 26, 2024 16:35
@JMS55
Copy link
Contributor

JMS55 commented Jan 27, 2024

Glad to see movement on this PR!

@JMS55
Copy link
Contributor

JMS55 commented Feb 6, 2024

As a followup to this PR, it would also be great to get a cooperative matrix multiplication extension, for neural net applications like burn.

@Lichtso
Copy link
Contributor

Lichtso commented Feb 24, 2024

https://github.com/Lichtso/wgpu/tree/subgroup_feature

@teoxoy I merged trunk into it as requested and fixed the tests. So it is ready for review now.

@cwfitzgerald
Copy link
Member

@Lichtso could you submit a new PR against that branch, and we'll go from there?

@Lichtso Lichtso mentioned this pull request Feb 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support subgroup/wave operations
7 participants